TITLE¶

  • Program Name: IIMK - Data Science and Artificial Intelligence
  • Project Title: Why Do Employees Quit? (Employee Attrition Analysis & Prediction)
  • Project By: Vivek Tanaji Kulthe
  • Progress State: Week 6

BACKGROUND¶

Organization Background:

ABC Technologies, is facing a concerning trend of high employee attrition even though they offer a competitive salary and benefits package.

Problem Statement:

This suggests the root cause of attrition lies beyond financial compensation. Company wants to identify the underlying factors driving employee departures to improve retention and maintain a strong talent pool.

Challenges Faced:

  • Project Delays
  • Reputational Damage
  • Low Employee Morale
  • Recruitement Cost Increase
  • Training Cost Increase

OBJECTIVE¶

  • Analyze & Visualize the employee data to uncover trends and patterns related to employee attrition

  • Identify key attributes associated with departing employees, viz;

    • Demographic Factors
    • Compensation and Benefits
    • Work Environment
    • Job Satisfaction
    • Managerial Influence
    • Career Progression
  • Build a regression model to predict if an employee is at a risk of attrition

  • Analyze model performance to identify the most significant predictors of attrition

PYTHON LIBRARIES¶

In [94]:
# Importing sys and os
# OS module is responsible for interacting with the operating system, providing access to the underlying interface of the operating system
# SYS module is responsible for the interaction between the program and the Python interpreter
import sys
import os

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

from google.colab import drive
drive.mount('/content/drive', force_remount=True)

#pd.options.display.max_columns = None

# The warnings filter controls whether warnings are ignored, displayed, or turned into errors (raising an exception)
# 'ignore' = never print matching warnings
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
pd.set_option('display.max_columns', 50)
import numpy as np
import re
from scipy import stats

import seaborn as sns
sns.set(rc = {'figure.figsize' : (14, 10)})
sns.set_style('whitegrid')
sns.color_palette('dark')


import matplotlib as mpl
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
from matplotlib import rcParams
rcParams['lines.linewidth'] = 2
rcParams['xtick.labelsize'] = 9
rcParams['ytick.labelsize'] = 9
rcParams['axes.labelsize'] = 9
rcParams['axes.titlesize'] = 10
rcParams['axes.titlepad'] = 10.0
rcParams['figure.dpi'] = 300
rcParams['savefig.dpi'] = 300



import sklearn as sk
from sklearn.model_selection import train_test_split #, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import ExtraTreesClassifier

from sklearn.preprocessing import MinMaxScaler, StandardScaler
# from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix
# from sklearn.linear_model import LogisticRegression
# from sklearn.neighbors import KNeighborsClassifier
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.svm import SVC
# from sklearn.ensemble import RandomForestClassifier
# import xgboost as xgb
# from xgboost import XGBClassifier

# import imblearn as im
# from imblearn import under_sampling, over_sampling


# set custom font ()
font_path = '/content/drive/MyDrive/Colab Notebooks/resources/BentonSans-Regular.ttf'
import matplotlib.font_manager as fm
fm.fontManager.addfont(font_path)
plt.rc('font', family='BentonSans')
plt.figure(facecolor='white')

#check the version
print('python', sys.version)
for pkg in [pd, np, mpl, sns, sk]:  #, im, xgb
   print(pkg.__name__, pkg.__version__)
Mounted at /content/drive
python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
pandas 2.0.3
numpy 1.25.2
matplotlib 3.7.1
seaborn 0.13.1
sklearn 1.2.2
<Figure size 4200x3000 with 0 Axes>

OBTAINING THE DATA : (O OF OSEMN)¶

⛁ Gather The Data From Relevant Sources¶

Dataset Used¶

  • The dataset that will be used is taken from kaggle and contains HR analytics data of employees that stay and leave

  • The types of data include metrics such as Education, Environment Satisfaction, Job Involvement, Job Satisfaction, Performance Rating, Relationship Satisfaction, Work-Life Balance, Commute Distance, etc.

  • Asset:Employee Attrition Data

  • License: Open Database License / Database Content License

  • Source Link: Kaggle

Reading The File

In [3]:
gd_path = '/content/drive/MyDrive/Colab Notebooks'
fname = 'dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv'
data = f"{gd_path}/{fname}"
df=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv')

Taking the glance at the data

In [4]:
# Take a glance at first 10 rows for quickly testing if our object has the right type of data in it
df.head(10)
Out[4]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1 1 2 Female 94 3 2 Sales Executive 4 Single 5993 19479 8 Y Yes 11 3 1 80 0 8 0 1 6 4 0 5
1 49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 1 2 3 Male 61 2 2 Research Scientist 2 Married 5130 24907 1 Y No 23 4 4 80 1 10 3 3 10 7 1 7
2 37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 1 4 4 Male 92 2 1 Laboratory Technician 3 Single 2090 2396 6 Y Yes 15 3 2 80 0 7 3 3 0 0 0 0
3 33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 1 5 4 Female 56 3 1 Research Scientist 3 Married 2909 23159 1 Y Yes 11 3 3 80 0 8 3 3 8 7 3 0
4 27 No Travel_Rarely 591 Research & Development 2 1 Medical 1 7 1 Male 40 3 1 Laboratory Technician 2 Married 3468 16632 9 Y No 12 3 4 80 1 6 3 3 2 2 2 2
5 32 No Travel_Frequently 1005 Research & Development 2 2 Life Sciences 1 8 4 Male 79 3 1 Laboratory Technician 4 Single 3068 11864 0 Y No 13 3 3 80 0 8 2 2 7 7 3 6
6 59 No Travel_Rarely 1324 Research & Development 3 3 Medical 1 10 3 Female 81 4 1 Laboratory Technician 1 Married 2670 9964 4 Y Yes 20 4 1 80 3 12 3 2 1 0 0 0
7 30 No Travel_Rarely 1358 Research & Development 24 1 Life Sciences 1 11 4 Male 67 3 1 Laboratory Technician 3 Divorced 2693 13335 1 Y No 22 4 2 80 1 1 2 3 1 0 0 0
8 38 No Travel_Frequently 216 Research & Development 23 3 Life Sciences 1 12 4 Male 44 2 3 Manufacturing Director 3 Single 9526 8787 0 Y No 21 4 2 80 0 10 2 3 9 7 1 8
9 36 No Travel_Rarely 1299 Research & Development 27 3 Medical 1 13 3 Male 94 3 2 Healthcare Representative 3 Married 5237 16577 6 Y No 13 3 2 80 2 17 3 2 7 7 7 7
In [5]:
# create a backup of original dataset for future use
df_raw = df.copy()

Understanding The Dataframe

In [6]:
# Let's see number of columns, column labels, column data types, memory usage, range index, and non-null information about DataFrame
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                  1470 non-null   int64 
 15  JobRole                   1470 non-null   object
 16  JobSatisfaction           1470 non-null   int64 
 17  MaritalStatus             1470 non-null   object
 18  MonthlyIncome             1470 non-null   int64 
 19  MonthlyRate               1470 non-null   int64 
 20  NumCompaniesWorked        1470 non-null   int64 
 21  Over18                    1470 non-null   object
 22  OverTime                  1470 non-null   object
 23  PercentSalaryHike         1470 non-null   int64 
 24  PerformanceRating         1470 non-null   int64 
 25  RelationshipSatisfaction  1470 non-null   int64 
 26  StandardHours             1470 non-null   int64 
 27  StockOptionLevel          1470 non-null   int64 
 28  TotalWorkingYears         1470 non-null   int64 
 29  TrainingTimesLastYear     1470 non-null   int64 
 30  WorkLifeBalance           1470 non-null   int64 
 31  YearsAtCompany            1470 non-null   int64 
 32  YearsInCurrentRole        1470 non-null   int64 
 33  YearsSinceLastPromotion   1470 non-null   int64 
 34  YearsWithCurrManager      1470 non-null   int64 
dtypes: int64(26), object(9)
memory usage: 402.1+ KB

From the information above, we can know that:

  • The dataset has a total of 35 columns and 1470 rows
  • There are no null or missing values in each column
  • The attrition column is a classification target with the data type object
  • The other columns are feature with data types int64 (26 columns) and object (8 columns)

Understanding The Data (Data Level and Types)¶

Qualitative Data Types¶

Nominal Data Types (Identify & Mark)¶

In [7]:
# Qualitative - Nominal (N)
qual_n = [
    'Attrition', # target
    'BusinessTravel',
    'Department',
    'EducationField',
    'EmployeeNumber', # id
    'Gender',
    'JobRole',
    'MaritalStatus',
    'Over18',
    'OverTime',
]

Ordinal Data Types (Identify & Mark)¶

In [8]:
# Qualitative - Ordinal (O)

qual_o = [
    'Education', # desc
    'EnvironmentSatisfaction', # desc
    'JobInvolvement', # desc
    'JobLevel',
    'JobSatisfaction', # desc
    'PerformanceRating', # desc
    'RelationshipSatisfaction', # desc
    'StockOptionLevel',
    'WorkLifeBalance', # desc
]
In [9]:
#Combined Qualitative Data Set (Nominal + Ordinal)
qual = qual_n + qual_o
df_qual = df[qual]
df_qual.head(10)
Out[9]:
Attrition BusinessTravel Department EducationField EmployeeNumber Gender JobRole MaritalStatus Over18 OverTime Education EnvironmentSatisfaction JobInvolvement JobLevel JobSatisfaction PerformanceRating RelationshipSatisfaction StockOptionLevel WorkLifeBalance
0 Yes Travel_Rarely Sales Life Sciences 1 Female Sales Executive Single Y Yes 2 2 3 2 4 3 1 0 1
1 No Travel_Frequently Research & Development Life Sciences 2 Male Research Scientist Married Y No 1 3 2 2 2 4 4 1 3
2 Yes Travel_Rarely Research & Development Other 4 Male Laboratory Technician Single Y Yes 2 4 2 1 3 3 2 0 3
3 No Travel_Frequently Research & Development Life Sciences 5 Female Research Scientist Married Y Yes 4 4 3 1 3 3 3 0 3
4 No Travel_Rarely Research & Development Medical 7 Male Laboratory Technician Married Y No 1 1 3 1 2 3 4 1 3
5 No Travel_Frequently Research & Development Life Sciences 8 Male Laboratory Technician Single Y No 2 4 3 1 4 3 3 0 2
6 No Travel_Rarely Research & Development Medical 10 Female Laboratory Technician Married Y Yes 3 3 4 1 1 4 1 3 2
7 No Travel_Rarely Research & Development Life Sciences 11 Male Laboratory Technician Divorced Y No 1 4 3 1 3 4 2 1 3
8 No Travel_Frequently Research & Development Life Sciences 12 Male Manufacturing Director Single Y No 3 4 2 3 3 4 2 0 3
9 No Travel_Rarely Research & Development Medical 13 Male Healthcare Representative Married Y No 3 3 3 2 3 3 2 2 2
In [10]:
# Let's check the data types for identified Nominal Data
df[qual_n].dtypes
Out[10]:
Attrition         object
BusinessTravel    object
Department        object
EducationField    object
EmployeeNumber     int64
Gender            object
JobRole           object
MaritalStatus     object
Over18            object
OverTime          object
dtype: object
In [11]:
# Let's check the data types for identified Ordinal Data (if there is need to decode the columns)
df[qual_o].dtypes
Out[11]:
Education                   int64
EnvironmentSatisfaction     int64
JobInvolvement              int64
JobLevel                    int64
JobSatisfaction             int64
PerformanceRating           int64
RelationshipSatisfaction    int64
StockOptionLevel            int64
WorkLifeBalance             int64
dtype: object
In [12]:
df[qual].dtypes
Out[12]:
Attrition                   object
BusinessTravel              object
Department                  object
EducationField              object
EmployeeNumber               int64
Gender                      object
JobRole                     object
MaritalStatus               object
Over18                      object
OverTime                    object
Education                    int64
EnvironmentSatisfaction      int64
JobInvolvement               int64
JobLevel                     int64
JobSatisfaction              int64
PerformanceRating            int64
RelationshipSatisfaction     int64
StockOptionLevel             int64
WorkLifeBalance              int64
dtype: object

Quantitative Data Types¶

Interval Data Types (Identify & Mark)¶

In [13]:
# Quantitative - Interval (I)
quant_i = [
    'Age',
    'DistanceFromHome',
    'EmployeeCount', # 1
    'NumCompaniesWorked',
    'StandardHours', # 1
    'TrainingTimesLastYear'
]

Ratio Data Types (Identify & Mark)¶

In [14]:
# Quantitative Ratio (R)
quant_r = [
    'DailyRate',
    'HourlyRate',
    'MonthlyRate',
    'PercentSalaryHike', # percentage
    'TotalWorkingYears',
    'MonthlyIncome',
    'YearsAtCompany',
    'YearsInCurrentRole',
    'YearsSinceLastPromotion',
    'YearsWithCurrManager'
]
In [15]:
quant = quant_i + quant_r
df_quant = df[quant]
df_quant.head(3)
Out[15]:
Age DistanceFromHome EmployeeCount NumCompaniesWorked StandardHours TrainingTimesLastYear DailyRate HourlyRate MonthlyRate PercentSalaryHike TotalWorkingYears MonthlyIncome YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
0 41 1 1 8 80 0 1102 94 19479 11 8 5993 6 4 0 5
1 49 8 1 1 80 3 279 61 24907 23 10 5130 10 7 1 7
2 37 2 1 6 80 3 1373 92 2396 15 7 2090 0 0 0 0

SCRUBBING THE DATA: (S OF OSEMN)¶

🗑 Clean data to formats that machine understands¶

Data Cleansing¶

In [16]:
# copying the raw dataset
df_clean = df_raw.copy()
  • df_raw = raw dataset
  • df_clean = dataset that has been cleaned (drop columns, drop rows, encoding, etc.)

Note: Use df_clean if you want to carry out a cleansing process so that the raw data retains its value.

Handling Single & Unique Data¶

Note:

  • Single data in question is a column that has only 1 unique value.
  • Unique data in question is a column whose entire row contains a unique value (such as employee number).
In [17]:
df_clean.nunique()
Out[17]:
Age                           43
Attrition                      2
BusinessTravel                 3
DailyRate                    886
Department                     3
DistanceFromHome              29
Education                      5
EducationField                 6
EmployeeCount                  1
EmployeeNumber              1470
EnvironmentSatisfaction        4
Gender                         2
HourlyRate                    71
JobInvolvement                 4
JobLevel                       5
JobRole                        9
JobSatisfaction                4
MaritalStatus                  3
MonthlyIncome               1349
MonthlyRate                 1427
NumCompaniesWorked            10
Over18                         1
OverTime                       2
PercentSalaryHike             15
PerformanceRating              2
RelationshipSatisfaction       4
StandardHours                  1
StockOptionLevel               4
TotalWorkingYears             40
TrainingTimesLastYear          7
WorkLifeBalance                4
YearsAtCompany                37
YearsInCurrentRole            19
YearsSinceLastPromotion       16
YearsWithCurrManager          18
dtype: int64

Observation:

  • EmployeeCount, Over18, StandardHours are columns that only have 1 unique value.
  • EmployeeNumber is a column whose entire row contains a unique value.
In [18]:
bef = len(df_clean.columns) # number of columns before removal
df_clean = df_clean.drop(['EmployeeCount', 'Over18', 'StandardHours', 'EmployeeNumber'], axis = 1) # columns dropped
aft = len(df_clean.columns) # number of columns after removal
print(f"The number of columns are reduced from {bef} to {aft}.")
The number of columns are reduced from 35 to 31.

Observation:

  • The number of columns are reduced from 35 to 31

Handling Missing Data¶

In [19]:
df_clean.isnull().sum()
Out[19]:
Age                         0
Attrition                   0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSinceLastPromotion     0
YearsWithCurrManager        0
dtype: int64

Observation:

  • There is no empty (null) data in any column in the dataframe

Handling Duplicated Data¶

In [20]:
df_clean.duplicated().sum()
Out[20]:
0

Observation:

  • There are no duplicates in each column

Handling Outliers¶

In [21]:
df_quant.describe()
Out[21]:
Age DistanceFromHome EmployeeCount NumCompaniesWorked StandardHours TrainingTimesLastYear DailyRate HourlyRate MonthlyRate PercentSalaryHike TotalWorkingYears MonthlyIncome YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
count 1470.000000 1470.000000 1470.0 1470.000000 1470.0 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000
mean 36.923810 9.192517 1.0 2.693197 80.0 2.799320 802.485714 65.891156 14313.103401 15.209524 11.279592 6502.931293 7.008163 4.229252 2.187755 4.123129
std 9.135373 8.106864 0.0 2.498009 0.0 1.289271 403.509100 20.329428 7117.786044 3.659938 7.780782 4707.956783 6.126525 3.623137 3.222430 3.568136
min 18.000000 1.000000 1.0 0.000000 80.0 0.000000 102.000000 30.000000 2094.000000 11.000000 0.000000 1009.000000 0.000000 0.000000 0.000000 0.000000
25% 30.000000 2.000000 1.0 1.000000 80.0 2.000000 465.000000 48.000000 8047.000000 12.000000 6.000000 2911.000000 3.000000 2.000000 0.000000 2.000000
50% 36.000000 7.000000 1.0 2.000000 80.0 3.000000 802.000000 66.000000 14235.500000 14.000000 10.000000 4919.000000 5.000000 3.000000 1.000000 3.000000
75% 43.000000 14.000000 1.0 4.000000 80.0 3.000000 1157.000000 83.750000 20461.500000 18.000000 15.000000 8379.000000 9.000000 7.000000 3.000000 7.000000
max 60.000000 29.000000 1.0 9.000000 80.0 6.000000 1499.000000 100.000000 26999.000000 25.000000 40.000000 19999.000000 40.000000 18.000000 15.000000 17.000000
In [22]:
df_quant.shape[0]
Out[22]:
1470
In [102]:
plt.figure(figsize = (15, 8))

for i in range(0, len(quant)):
    plt.subplot(2, 8, i+1)
    sns.boxplot(y = df[quant[i]], color = '#005495', orient = 'v')
    plt.title(quant[i], fontweight = 'bold')

plt.tight_layout(pad = 2.0)
plt.show()

From the box plot above, we can see that:

  • There are quite a lot of outliers in the column:
    • TotalWorkingYears
    • TrainingTimesLastYear
    • YearsAtCompany
    • YearsInCurrentRole
    • YearsSinceLastPromotion
    • YearsWithCurrManager
    • MonthlyIncome
  • Most of the columns containing outliers are total years of data, meaning that each employee's experience is quite varied.
In [24]:
# Considering the above chart, column from which outliers needs to be removed are;
cols = ['TrainingTimesLastYear','TotalWorkingYears','MonthlyIncome','YearsAtCompany','YearsInCurrentRole','YearsSinceLastPromotion','YearsWithCurrManager']
In [101]:
# Before
plt.figure(figsize = (14, 4))
for i in range(0, len(cols)):
    plt.subplot(1, 7, i+1)
    sns.boxplot(y = df_clean[cols[i]], color = '#005495', orient = 'v')

plt.suptitle('Outliers Before Cleansing', y = 1.05, fontsize = 12)
plt.tight_layout(pad = 1.5)
plt.show()

Z-Score Method¶

In [26]:
print(f'Number of rows before removing outliers: {len(df_clean)}')

filtered_entries = np.array([True] * len(df_clean))

for col in cols:
    zscore = abs(stats.zscore(df_clean[col])) # calculate the absolute z-score
    filtered_entries = (zscore < 3) & filtered_entries # keep less than 3 absolute z-scores

df_ZSCORE = df_clean[filtered_entries] # filter, only take those whose z-score is below 3

print(f'Number of rows after removing outliers: {len(df_ZSCORE)}')
Number of rows before removing outliers: 1470
Number of rows after removing outliers: 1387
In [82]:
# After using Z-SCORE
plt.figure(figsize = (14, 4))
for i in range(0, len(cols)):
    plt.subplot(1, 7, i+1)
    sns.boxplot(y = df_ZSCORE[cols[i]], color = '#005495', orient = 'v')

plt.suptitle('Outliers After Cleansing by Z-Score', y = 1.05, fontsize = 12)
plt.tight_layout(pad = 1.5)
plt.show()

IQR (Inter-Quartile Range) Method¶

In [28]:
print(f'Number of rows before removing outliers: {len(df_clean)}')

filtered_entries = np.array([True] * len(df_clean))
for col in cols:
    Q1 = df_clean[col].quantile(0.25)
    Q3 = df_clean[col].quantile(0.75)
    IQR = Q3 - Q1
    low_limit = Q1 - (IQR * 1.5)
    high_limit = Q3 + (IQR * 1.5)

    filtered_entries = ((df_clean[col] >= low_limit) & (df_clean[col] <= high_limit)) & filtered_entries

df_IQR = df_clean[filtered_entries]

print(f'Number of rows after removing outliers: {len(df_IQR)}')
Number of rows before removing outliers: 1470
Number of rows after removing outliers: 1024
In [83]:
# After using IQR
plt.figure(figsize = (14, 4))
for i in range(0, len(cols)):
    plt.subplot(1, 7, i+1)
    sns.boxplot(y = df_IQR[cols[i]], color = '#005495', orient = 'v')

plt.suptitle('Outliers After Cleansing by IQR', y = 1.05, fontsize = 12)
plt.tight_layout(pad = 1.5)
plt.show()

Result¶

Z-Score

  • Number of rows before removing outliers: 1470
  • Number of rows after removing outliers: 1387
  • Number of rows removed: 83

IQR

  • Number of rows before removing outliers: 1470
  • Number of rows after removing outliers: 1024
  • Number of rows removed: 446

Observation:

  • Z-Score Method is more suitable because IQR is removing lot of data from the dataset. </font>
In [30]:
df_clean = df_ZSCORE.copy()
df_clean.head()
Out[30]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked OverTime PercentSalaryHike PerformanceRating RelationshipSatisfaction StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 2 Female 94 3 2 Sales Executive 4 Single 5993 19479 8 Yes 11 3 1 0 8 0 1 6 4 0 5
1 49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 3 Male 61 2 2 Research Scientist 2 Married 5130 24907 1 No 23 4 4 1 10 3 3 10 7 1 7
2 37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 4 Male 92 2 1 Laboratory Technician 3 Single 2090 2396 6 Yes 15 3 2 0 7 3 3 0 0 0 0
3 33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 4 Female 56 3 1 Research Scientist 3 Married 2909 23159 1 Yes 11 3 3 0 8 3 3 8 7 3 0
4 27 No Travel_Rarely 591 Research & Development 2 1 Medical 1 Male 40 3 1 Laboratory Technician 2 Married 3468 16632 9 No 12 3 4 1 6 3 3 2 2 2 2

EXPLORING THE DATA (E OF OSEMN)¶

DESCRIPTIVE ANALYTICS¶

Calculate Business Metrics (COMPANY LEVEL ATTRITION RATE)¶

Calculate Attrition Rate for ABC Technologies

How to Calculate Employee Turnover Rate?
To calculate employee turnover rate, you’ll need to know, within a given time period, the number of employee separations and the average number of employees present. The average number of employees can be determined by adding together the number of employees on the first day and the last day of the time period, then dividing that sum total by two.

(Source)

Then, plug the above values into the following turnover rate formula:

Turnover rate = [(# of employee separations) / (average # of employees)] x 100
In [31]:
df_attrition = df[['Attrition', 'JobLevel', 'MonthlyIncome', 'YearsAtCompany']].copy()
df_attrition.head()
Out[31]:
Attrition JobLevel MonthlyIncome YearsAtCompany
0 Yes 2 5993 6
1 No 2 5130 10
2 Yes 1 2090 0
3 No 1 2909 8
4 No 1 3468 2

Add a column to identify if the employee joined this year

In [32]:
# is a new employee this year?
ls_jointhisyear = []
for idx, col in df_attrition.iterrows():
    if col['YearsAtCompany'] < 1:
        ls_jointhisyear.append(1) # worked for < 1 year
    else:
        ls_jointhisyear.append(0) # worked for > 1 year

df_attrition['JoinThisYear'] = ls_jointhisyear
df_attrition.head()
Out[32]:
Attrition JobLevel MonthlyIncome YearsAtCompany JoinThisYear
0 Yes 2 5993 6 0
1 No 2 5130 10 0
2 Yes 1 2090 0 1
3 No 1 2909 8 0
4 No 1 3468 2 0

Calculate the Attrition Rate

In [33]:
employee_start = df_attrition.loc[df_attrition['JoinThisYear'] == 0].shape[0] # the beginning of 2020
employee_end = df_attrition.shape[0] # the end of 2020
employee_avg = (employee_start + employee_end) / 2 # average of employees number in 2020
attrition_total = df_attrition.loc[df_attrition['Attrition'] == 'Yes'].shape[0] # total attrition
turnover_rate = attrition_total / employee_avg # turnover rate

print("Employee Attrition in 2020")
print("--------------------------------------------")

# attrition number in 2020
print(f"Average of Employees Number = {employee_avg:.0f} employees")
print(f"Total Attrition Number = {attrition_total} employees")
# specified for each job level
for i in sorted(df_attrition['JobLevel'].unique()):
    attrition_level = df_attrition.loc[(df_attrition['Attrition'] == 'Yes') & (df_attrition['JobLevel'] == i)].shape[0]
    print(f"  - Job Level {i} = {attrition_level} employees")
print("--------------------------------------------")

# turnover rate in 2020
print(f"Attrition Rate = {turnover_rate:.2%}")
Employee Attrition in 2020
--------------------------------------------
Average of Employees Number = 1448 employees
Total Attrition Number = 237 employees
  - Job Level 1 = 143 employees
  - Job Level 2 = 52 employees
  - Job Level 3 = 32 employees
  - Job Level 4 = 5 employees
  - Job Level 5 = 5 employees
--------------------------------------------
Attrition Rate = 16.37%

Statistical¶

Graphical¶

Quantitative Feature Distribution (Histogram)¶

In [85]:
plt.figure(figsize = (18, 14))

for i in range(0, len(quant)):
    plt.subplot(4, 4, i+1)
    sns.histplot(df[quant[i]], color='navy', kde=True, stat="density")  # Add kde and stat
    plt.title(quant[i], fontsize = 14)
    plt.xlabel('')
    plt.ylabel('')

plt.tight_layout()
plt.show()

Quanlitative Feature Distribution (BarPlot)¶

In [93]:
plt.figure(figsize = (18, 12))

for i in range(0, len(qual)):
    plt.subplot(4, 5, i+1)
    sns.countplot(x = qual[i], data = df_qual, color = '#005495', orient = 'v')
    plt.title(qual[i], fontsize = 14)
    plt.xlabel('')
    plt.ylabel('')

plt.tight_layout()
plt.show()

Descriptive Analytics Observations¶

Observations:

The attrition rate of ABC Technologies is 16.37%

From the QUANTITATIVE attributes Descriptive Analytics, we can observe that:

  • When looking at the Histogram above, the MonthlyIncome, PercentSalaryHike, YearsAtCompany columns are slightly skewed to the right/positively skewed
  • Indicating that more employee salaries tend to be low and more employees worked at ABC Technologies less than 1 year.

Apart from those columns, the distribution looks normal

From the QUANTITATIVE attributes Descriptive Analytics, we can observe that:

  • The majority of employees have a high frequency of business trips, namely around 1000 employees.
  • Research & Development is the department with the largest number of employees, around 900 employees.
  • The company is dominated by 800 male employees.
  • The company is dominated by married employees
  • Only a small number of employees work overtime
  • The company is dominated by employees with bachelor's degrees
  • Only a small number of employees show extraordinary performance (4: outstanding) and there are no employees who have low performance (1: low) </font>

DIAGNOSTIC ANALYTICS¶

Numerical & Categorical Ordinal Correlation (Heatmap)¶

In [87]:
quant_ord = quant + qual_o # quantitative + qualitative ordinal
plt.figure(figsize = (15, 15))
sns.heatmap(df[quant_ord].corr(), cmap = 'YlGnBu', annot = True, annot_kws={"size": 8}, fmt = '.2f')
plt.show()

Observations:

  • PercentSalaryHike and PerformanceRating have a fairly strong positive relationship

  • TotalWorkingYears has a fairly strong positive relationship with Age, MonthlyIncome, and JobLevel

  • YearsAtCompany has a fairly strong positive relationship with YearsInCurrentRole and YearsWithCurrManager

</font>

Attrition Num of Each Quantitative Feature (KDE Plot)¶

In [88]:
# check the sum of numeric data + categorical ordinal
print(f"The amount of quantitative data = {len(quant)}")
The amount of quantitative data = 16
In [90]:
plt.figure(figsize = (18, 14))

for i in range(0, len(quant)):
    plt.subplot(4, 4, i+1)
    sns.kdeplot(x = quant[i], hue = 'Attrition', data = df, palette = 'colorblind', linewidth = 1, fill = True, legend = True)

    plt.axvline(df[quant[i]].loc[df['Attrition'] == 'No'].median(), color = '#5088bd', linewidth = 0.8, linestyle = '-.', label = 'No')
    plt.axvline(df[quant[i]].loc[df['Attrition'] == 'Yes'].median(), color = '#005495', linewidth = 0.8, linestyle = '-.', label = 'Yes')

    plt.title(quant[i], fontsize = 12)
    plt.xlabel('')
    plt.ylabel('')
    plt.legend(['No (Median)', 'Yes (Median)', 'Attrition: No', 'Attrition: Yes',], ncol = 2, loc = 'upper center')

plt.tight_layout(pad = 2.0)
plt.show()

Attrition Num & Rate of Each Qualitative Feature (Bar Plot)¶

In [91]:
# check the amount of qualitative data
print(f"The amount of qualitative data = {len(qual)-1}") # except attrition
The amount of qualitative data = 18
In [92]:
plt.figure(figsize = (20, 30))

k = 0
for i in range(1, len(qual)):
    # attrition num
    plt.subplot(9, 4, i+k)
    sns.countplot(x = qual[i], hue = 'Attrition', data = df_qual, palette = 'colorblind', orient = 'v')

    plt.title(f"{qual[i]}: Attrition Num", fontsize = 14)
    plt.xlabel('')
    plt.ylabel('')

    # attrition rate
    df_temp = df_qual.groupby([qual[i], 'Attrition']).size().reset_index().rename({0 : 'Size'}, axis = 1)
    df_temp['TotalSize'] = df_temp.groupby([qual[i]])['Size'].transform(sum)
    df_temp['AttritionRate'] = df_temp['Size'] / df_temp['TotalSize']

    df_temp = df_temp.loc[df_temp['Attrition'] == 'Yes'].reset_index(drop = True)
    df_temp = df_temp.sort_values(by = 'AttritionRate', ascending = False)

    plt.subplot(9, 4, i+k+1)
    sns.barplot(y = qual[i], x = 'AttritionRate', data = df_temp, palette = 'colorblind', orient = 'h')

    plt.title(f"{qual[i]}: Attrition Rate", fontsize = 14)
    plt.xlabel('')
    plt.ylabel('')
    plt.xlim(0.0, 0.5)

    k += 1

plt.tight_layout(w_pad = 0.5)
plt.show()

Diagnostic Analytics Observations¶

Career Progression

Attributes: JobLevel, PerformanceRating, TotalWorkingYears, TrainingTimesLastYear, YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion

Observations:

  • It is not uncommon for employees who have only worked at the company (YearsAtCompany) for a few years (0-5 years) to decide to leave the company, compared to those who have worked for more than 10 years and remain at the company.

Compensation and Benefits

Attributes: DailyRate, HourlyRate, MonthlyIncome, MonthlyRate, PercentSalaryHike, StockOptionLevel

Observations:

  • The salary distribution (hourly/daily/monthly rate) of employees who leave the company is not much different from employees who remain, meaning it is not an important factor as a reason to leave the company.
  • Employees who have monthly income which tends to be low (USD 0-5,000) decide to leave the company.

Demographic Factors

Attributes: Age, Education, EducationField, Gender, MaritalStatus, NumCompaniesWorked, Over18

Observations:

  • Employees who leave the company tend to be younger (20-35 years) than employees who remain.
  • More men leave the company than women, namely around 150 people.
  • Employees with single status have a fairly high attrition rate and number compared to married and divorce.

Job Satisfaction

Attributes: JobInvolvement, JobRole, JobSatisfaction, RelationshipSatisfaction

Observations:

  • The turnover tendency of Sales Representative is very high compared to other roles, followed by Laboratory Technician and Human Resources.
  • The higher the job involvement (job involvement), the lower the attrition rate.

Managerial Influence

Attribute: YearsWithCurrManager Observations:

  • We could not observe any specific pattern with respect to this attribute.

Work Environment

Attributes: BusinessTravel, Department, DistanceFromHome ,EmployeeCount, EnvironmentSatisfaction, OverTime, StandardHours , WorkLifeBalance

Observations:

  • Around 150 employees who frequently travel on business tend to choose to leave the company, where the attrition rate is the highest compared to employees who rarely and do not travel on business.
  • Sales Department has the highest attrition rate, namely 20%, but in terms of numbers there are more employees in the R&D Department. (meaning more cost is incurred to carry out turnover in that department)
  • Human Resources has the lowest number of attrition in terms of number of employees.
  • Employees who do overtime, have a fairly high attrition rate, amounting to 30% or 3x as much as those who don't. (meaning it is very detrimental in terms of cost to do turnover)
  • Employees who have low satisfaction with the work environment (EnvironmentSatisfaction) have a fairly high attrition rate.
  • There may be a relationship between Travel Frequently, Sales Representative, and Overtime.

Feature Engineering¶

Create New Features¶

In [41]:
# Add "Group Age" Feature
df_clean['GroupAge'] = np.where(df_clean['Age'] <= 30, '18-30',
                                np.where((df_clean['Age'] > 30) & (df_clean['Age'] <= 40), '31-40',
                                         np.where((df_clean['Age'] > 40) & (df_clean['Age'] <= 50), '41-50', '51-60')))
df_clean[['Age', 'GroupAge']]
Out[41]:
Age GroupAge
0 41 41-50
1 49 41-50
2 37 31-40
3 33 31-40
4 27 18-30
... ... ...
1465 36 31-40
1466 39 31-40
1467 27 18-30
1468 49 41-50
1469 34 31-40

1387 rows × 2 columns

In [42]:
# Median Monthly Income By Job Level
MedIncome = df_clean.groupby(['JobLevel'])['MonthlyIncome'].median()
MedIncome
Out[42]:
JobLevel
1     2661.5
2     5343.0
3     9852.0
4    15992.0
5    19174.0
Name: MonthlyIncome, dtype: float64
In [43]:
# Add Feature Median Monthly Income By Job Level (below == 1, above == 0)
df_clean['BelowMedIncome'] = df_clean['JobLevel'].apply(lambda x: MedIncome[x])
df_clean['BelowMedIncome'] = np.where(df_clean['MonthlyIncome'] < df_clean['BelowMedIncome'], 1, 0)
In [44]:
# added group age (18-30) & overtime (yes) features
df_clean['GroupAge_Overtime'] = np.where((df_clean['GroupAge'] == '18-30') & (df_clean['OverTime'] == 'Yes'), 1, 0)

# added job level (1) & overtime (yes) features
df_clean['JobLevel_Overtime'] = np.where((df_clean['JobLevel'] == 1) & (df_clean['OverTime'] == 'Yes'), 1, 0)

# added job level (1) & below median monthly income (1) & overtime (yes)
df_clean['JobLevel_BelowMedIncome_Overtime'] = np.where((df_clean['JobLevel'] == 1) & (df_clean['BelowMedIncome'] == 1) & (df_clean['OverTime'] == 'Yes'), 1, 0)
In [45]:
df_clean
Out[45]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked OverTime PercentSalaryHike PerformanceRating RelationshipSatisfaction StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager GroupAge BelowMedIncome GroupAge_Overtime JobLevel_Overtime JobLevel_BelowMedIncome_Overtime
0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 2 Female 94 3 2 Sales Executive 4 Single 5993 19479 8 Yes 11 3 1 0 8 0 1 6 4 0 5 41-50 0 0 0 0
1 49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 3 Male 61 2 2 Research Scientist 2 Married 5130 24907 1 No 23 4 4 1 10 3 3 10 7 1 7 41-50 1 0 0 0
2 37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 4 Male 92 2 1 Laboratory Technician 3 Single 2090 2396 6 Yes 15 3 2 0 7 3 3 0 0 0 0 31-40 1 0 1 1
3 33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 4 Female 56 3 1 Research Scientist 3 Married 2909 23159 1 Yes 11 3 3 0 8 3 3 8 7 3 0 31-40 0 0 1 0
4 27 No Travel_Rarely 591 Research & Development 2 1 Medical 1 Male 40 3 1 Laboratory Technician 2 Married 3468 16632 9 No 12 3 4 1 6 3 3 2 2 2 2 18-30 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1465 36 No Travel_Frequently 884 Research & Development 23 2 Medical 3 Male 41 4 2 Laboratory Technician 4 Married 2571 12290 4 No 17 3 3 1 17 3 3 5 2 0 3 31-40 1 0 0 0
1466 39 No Travel_Rarely 613 Research & Development 6 1 Medical 4 Male 42 2 3 Healthcare Representative 1 Married 9991 21457 4 No 15 3 1 1 9 5 3 7 7 1 7 31-40 0 0 0 0
1467 27 No Travel_Rarely 155 Research & Development 4 3 Life Sciences 2 Male 87 4 2 Manufacturing Director 2 Married 6142 5174 1 Yes 20 4 2 1 6 0 3 6 2 0 3 18-30 0 1 0 0
1468 49 No Travel_Frequently 1023 Sales 2 3 Medical 4 Male 63 2 2 Sales Executive 2 Married 5390 13243 2 No 14 3 4 0 17 3 2 9 6 0 8 41-50 0 0 0 0
1469 34 No Travel_Rarely 628 Research & Development 8 3 Medical 2 Male 82 4 2 Laboratory Technician 3 Married 4404 10228 2 No 12 3 1 0 6 3 4 4 3 1 2 31-40 1 0 0 0

1387 rows × 36 columns

Normalization/Standardization (Z-Score Method)¶

Perform standardization if the numeric feature has a normal distribution, if not, then use normalization.

In [46]:
# save data for simulation
df_simulation = df_clean.copy()
df_simulation.to_csv(f"{gd_path}/dataset/data_valid_all_sim.csv", index = False)
In [47]:
# standardization scaling
std_cols = ['Age']
for col in std_cols:
    x = df_clean[col].values.reshape(len(df_clean), 1)
    scaler = StandardScaler()
    x_scaled = scaler.fit_transform(x)
    df_clean[col] = x_scaled

# normalization scaling
nrm_cols = ['DailyRate', 'DistanceFromHome', 'HourlyRate', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike', 'TotalWorkingYears', 'TrainingTimesLastYear', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager']
for col in nrm_cols:
    x = df_clean[col].values.reshape(len(df_clean), 1)
    scaler = MinMaxScaler()
    x_scaled = scaler.fit_transform(x)
    df_clean[col] = x_scaled

df_clean[std_cols + nrm_cols].head(10)
Out[47]:
Age DailyRate DistanceFromHome HourlyRate MonthlyIncome MonthlyRate NumCompaniesWorked PercentSalaryHike TotalWorkingYears TrainingTimesLastYear YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
0 0.536681 0.716332 0.000000 0.914286 0.263230 0.698016 0.888889 0.000000 0.235294 0.000000 0.24 0.266667 0.000000 0.357143
1 1.442111 0.126791 0.250000 0.442857 0.217651 0.915991 0.111111 0.857143 0.294118 0.500000 0.40 0.466667 0.090909 0.500000
2 0.083966 0.910458 0.035714 0.885714 0.057093 0.012007 0.666667 0.285714 0.205882 0.500000 0.00 0.000000 0.000000 0.000000
3 -0.368749 0.924069 0.071429 0.371429 0.100349 0.845796 0.111111 0.000000 0.235294 0.500000 0.32 0.466667 0.272727 0.000000
4 -1.047821 0.350287 0.035714 0.142857 0.129872 0.583688 1.000000 0.071429 0.176471 0.500000 0.08 0.133333 0.181818 0.142857
5 -0.481928 0.646848 0.035714 0.700000 0.108746 0.392217 0.000000 0.142857 0.235294 0.333333 0.28 0.466667 0.272727 0.428571
6 2.573899 0.875358 0.071429 0.728571 0.087726 0.315918 0.444444 0.642857 0.352941 0.500000 0.04 0.000000 0.000000 0.000000
7 -0.708285 0.899713 0.821429 0.528571 0.088941 0.451289 0.111111 0.785714 0.029412 0.333333 0.04 0.000000 0.000000 0.000000
8 0.197145 0.081662 0.785714 0.200000 0.449826 0.268653 0.000000 0.714286 0.294118 0.333333 0.36 0.466667 0.090909 0.571429
9 -0.029213 0.857450 0.928571 0.914286 0.223302 0.581479 0.666667 0.142857 0.500000 0.500000 0.28 0.466667 0.636364 0.500000

Feature Encoding¶

Determine the categorical features that need to be labeled normally (0, 1, etc.) or by one-hot encoding.

In [48]:
# traditional labeling
label_cols = ['Attrition', 'OverTime', 'Gender', 'PerformanceRating']
df_clean['Attrition'] = df_clean['Attrition'].map({'No' : 0, 'Yes' : 1})
df_clean['OverTime'] = df_clean['OverTime'].map({'No' : 0, 'Yes' : 1})
df_clean['Gender'] = df_clean['Gender'].map({'Female' : 0, 'Male' : 1})
df_clean['PerformanceRating'] = df_clean['PerformanceRating'].map({3 : 0, 4 : 1}) # only consist of 2 unique values

# one-hot encoding
onehot_cols = ['BusinessTravel', 'Department', 'EducationField', 'JobRole', 'MaritalStatus', 'Education', 'EnvironmentSatisfaction', 'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'RelationshipSatisfaction', 'StockOptionLevel', 'WorkLifeBalance', 'GroupAge']
result_cols = []
for col in onehot_cols:
    onehot = pd.get_dummies(df_clean[col], prefix = col)
    dummies_cols = list(onehot.columns)
    result_cols = result_cols + dummies_cols
    df_clean = df_clean.join(onehot)
df_clean = df_clean.drop(onehot_cols, axis = 1)

df_clean[label_cols + result_cols].head()
Out[48]:
Attrition OverTime Gender PerformanceRating BusinessTravel_Non-Travel BusinessTravel_Travel_Frequently BusinessTravel_Travel_Rarely Department_Human Resources Department_Research & Development Department_Sales EducationField_Human Resources EducationField_Life Sciences EducationField_Marketing EducationField_Medical EducationField_Other EducationField_Technical Degree JobRole_Healthcare Representative JobRole_Human Resources JobRole_Laboratory Technician JobRole_Manager JobRole_Manufacturing Director JobRole_Research Director JobRole_Research Scientist JobRole_Sales Executive JobRole_Sales Representative ... JobLevel_1 JobLevel_2 JobLevel_3 JobLevel_4 JobLevel_5 JobSatisfaction_1 JobSatisfaction_2 JobSatisfaction_3 JobSatisfaction_4 RelationshipSatisfaction_1 RelationshipSatisfaction_2 RelationshipSatisfaction_3 RelationshipSatisfaction_4 StockOptionLevel_0 StockOptionLevel_1 StockOptionLevel_2 StockOptionLevel_3 WorkLifeBalance_1 WorkLifeBalance_2 WorkLifeBalance_3 WorkLifeBalance_4 GroupAge_18-30 GroupAge_31-40 GroupAge_41-50 GroupAge_51-60
0 1 1 0 0 False False True False False True False True False False False False False False False False False False False True False ... False True False False False False False False True True False False False True False False False True False False False False False True False
1 0 0 1 1 False True False False True False False True False False False False False False False False False False True False False ... False True False False False False True False False False False False True False True False False False False True False False False True False
2 1 1 1 0 False False True False True False False False False False True False False False True False False False False False False ... True False False False False False False True False False True False False True False False False False False True False False True False False
3 0 1 0 0 False True False False True False False True False False False False False False False False False False True False False ... True False False False False False False True False False False True False True False False False False False True False False True False False
4 0 0 1 0 False False True False True False False False False True False False False False True False False False False False False ... True False False False False False True False False False False False True False True False False False False True False True False False False

5 rows × 66 columns

Split Unseen Data¶

In [49]:
x = df_clean.drop(['Attrition'], axis = 1) # features
y = df_clean[['Attrition']] # target

# train = training data for modeling
# test = unseen data for validation
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.1, random_state = 42)
In [50]:
df_clean = pd.DataFrame(y_train).join(x_train)
df_unseen = pd.DataFrame(y_test).join(x_test)

print(df_clean.shape)
print(df_unseen.shape)
(1248, 84)
(139, 84)
In [51]:
df_clean['Attrition'].value_counts()
Out[51]:
Attrition
0    1051
1     197
Name: count, dtype: int64

Clean Data [Attrition] Attribute Counts:

  • 0 = 1051 counts
  • 1 = 197 counts
In [52]:
df_unseen['Attrition'].value_counts()
Out[52]:
Attrition
0    107
1     32
Name: count, dtype: int64

Unseen data [Attrition] Attribute Counts:

  • 0 = 107 counts
  • 1 = 32 counts

Feature Selection¶

Select which features are needed for modeling based on EDA observations or other considerations such as experience.

In [53]:
df_selection = df_clean.copy()
In [54]:
x = df_selection.drop(['Attrition', 'GroupAge_18-30', 'GroupAge_31-40', 'GroupAge_41-50', 'GroupAge_51-60', 'BelowMedIncome', 'GroupAge_Overtime', 'JobLevel_Overtime', 'JobLevel_BelowMedIncome_Overtime'], axis = 1) # all features exclude new features
y = df_selection[['Attrition']] # target

# fitting the data
model = ExtraTreesClassifier(random_state = 42)
model.fit(x, y)
result = model.feature_importances_

# convert to dataframe
feat_importances = pd.DataFrame({'Feature' : x.columns, 'Importance' : result})
feat_importances['Feature_Display'] = feat_importances['Feature'].apply(lambda x: " ".join(re.findall('[A-Z][^A-Z]*', x)))
feat_importances['Feature_Display'] = feat_importances['Feature_Display'].apply(lambda x: x.replace("_", ": "))
feat_importances
Out[54]:
Feature Importance Feature_Display
0 Age 0.028026 Age
1 DailyRate 0.018761 Daily Rate
2 DistanceFromHome 0.023421 Distance From Home
3 Gender 0.013839 Gender
4 HourlyRate 0.020784 Hourly Rate
... ... ... ...
70 StockOptionLevel_3 0.007303 Stock Option Level: 3
71 WorkLifeBalance_1 0.010921 Work Life Balance: 1
72 WorkLifeBalance_2 0.015137 Work Life Balance: 2
73 WorkLifeBalance_3 0.014842 Work Life Balance: 3
74 WorkLifeBalance_4 0.008583 Work Life Balance: 4

75 rows × 3 columns

In [55]:
# set plot size
plt.figure(figsize = (9, 6))

# plotting the feature importances
data = feat_importances.sort_values(['Importance'], ascending = False).iloc[:20] # top 20
sns.barplot(y = 'Feature_Display', x = 'Importance', data = data, palette = 'colorblind')
plt.title('Top 20 Key Factor of Employee Attrition', fontweight = 'bold', pad = 20)
plt.xlabel('') # clear label
plt.ylabel('') # clear label

plt.tight_layout()
plt.show()

Overtime is very important in considering employee attrition, very different from other features.

We will create a model to predict employee attrition with features that have a feature importance scale greater than 0.02.

And added features:

  • Department_Sales
  • Department_ResearchAndDevelopment
  • Department_HumanResources
  • JobSatisfaction_1
  • RelationshipSatisfaction_1
In [56]:
# filtering features
selected_features = list(feat_importances.loc[feat_importances['Importance'] >= 0.02]['Feature'])
add_features1 = ['Department_Sales', 'Department_Research & Development', 'Department_Human Resources', 'JobSatisfaction_1', 'RelationshipSatisfaction_1']
add_features2 = ['GroupAge_18-30', 'GroupAge_31-40', 'GroupAge_41-50', 'GroupAge_51-60', 'BelowMedIncome', 'GroupAge_Overtime', 'JobLevel_Overtime', 'JobLevel_BelowMedIncome_Overtime']
In [57]:
# clean data and validation of all features
df_clean_all = df_clean.copy()
df_valid_all = df_unseen.copy()

# clean and validated data containing feature importance > 0.02 and additional group agreement features
df_clean_selected = df_clean[['Attrition']].join(df_clean[selected_features + add_features1])
df_valid_selected = df_unseen[['Attrition']].join(df_unseen[selected_features + add_features1])

# clean and validated data containing feature importance > 0.02, additional group agreement features, and additional new features (group age, etc.)
df_clean_selected_new = df_clean[['Attrition']].join(df_clean[selected_features + add_features1 + add_features2])
df_valid_selected_new = df_unseen[['Attrition']].join(df_unseen[selected_features + add_features1 + add_features2])
In [58]:
# check categorical data type
df[qual].dtypes
Out[58]:
Attrition                   object
BusinessTravel              object
Department                  object
EducationField              object
EmployeeNumber               int64
Gender                      object
JobRole                     object
MaritalStatus               object
Over18                      object
OverTime                    object
Education                    int64
EnvironmentSatisfaction      int64
JobInvolvement               int64
JobLevel                     int64
JobSatisfaction              int64
PerformanceRating            int64
RelationshipSatisfaction     int64
StockOptionLevel             int64
WorkLifeBalance              int64
dtype: object
In [59]:
df['Department'].value_counts()
Out[59]:
Department
Research & Development    961
Sales                     446
Human Resources            63
Name: count, dtype: int64
In [60]:
df_imp = df.copy()

df_imp['BusinessTravel'] = df_imp['BusinessTravel'].map({'Non-Travel' : 0, 'Travel_Rarely' : 1, 'Travel_Frequently' : 2})
df_imp['Gender'] = df_imp['Gender'].map({'Male' : 0, 'Female' : 1})
df_imp['OverTime'] = df_imp['OverTime'].map({'No' : 0, 'Yes' : 1})
df_imp['MaritalStatus'] = df_imp['MaritalStatus'].map({'Single' : 0, 'Married' : 1, 'Divorced' : 2})

df_imp['EducationField'] = df_imp['EducationField'].astype('category').cat.codes
df_imp['JobRole'] = df_imp['JobRole'].astype('category').cat.codes

df_imp = df_imp.drop(['Over18', 'EmployeeNumber'], axis = 1)
In [61]:
# check data type after encoding
df_imp.dtypes
Out[61]:
Age                          int64
Attrition                   object
BusinessTravel               int64
DailyRate                    int64
Department                  object
DistanceFromHome             int64
Education                    int64
EducationField                int8
EmployeeCount                int64
EnvironmentSatisfaction      int64
Gender                       int64
HourlyRate                   int64
JobInvolvement               int64
JobLevel                     int64
JobRole                       int8
JobSatisfaction              int64
MaritalStatus                int64
MonthlyIncome                int64
MonthlyRate                  int64
NumCompaniesWorked           int64
OverTime                     int64
PercentSalaryHike            int64
PerformanceRating            int64
RelationshipSatisfaction     int64
StandardHours                int64
StockOptionLevel             int64
TotalWorkingYears            int64
TrainingTimesLastYear        int64
WorkLifeBalance              int64
YearsAtCompany               int64
YearsInCurrentRole           int64
YearsSinceLastPromotion      int64
YearsWithCurrManager         int64
dtype: object
In [62]:
df_imp['Department'].nunique()
Out[62]:
3
In [63]:
dept_unique = df_imp['Department'].unique()

fig = plt.figure(figsize = (20, 6))

for i in range(0, len(dept_unique)):
    x = df_imp.loc[df_imp['Department'] == dept_unique[i]].drop(['Attrition', 'Department'], axis = 1) # feature
    y = df_imp.loc[df_imp['Department'] == dept_unique[i]]['Attrition'] # target

    feat_importance = ExtraTreesClassifier().fit(x, y).feature_importances_

    feat_imp = pd.DataFrame({'column' : x.columns, 'value' : feat_importance}).sort_values('value', ascending = False).reset_index(drop = True)
    feat_imp['column'] = feat_imp['column'].apply(lambda x: " ".join(re.findall('[A-Z][^A-Z]*', x)))

    plt.subplot(1, 3, i+1)
    sns.barplot(y = 'column', x = 'value', data = feat_imp.iloc[0:10], palette = 'colorblind')

    plt.title(f"{df_imp['Department'].unique()[i]} Department", fontweight = 'bold', fontsize = 14)
    plt.xlabel('')
    plt.ylabel('')
    plt.xlim(0.0, 0.095)

fig.suptitle("The Top 10 Key Factor of Attrition in Each Department", fontweight = 'bold', y = 1.1)
fig.text(x = 0, y = -0.05, s = "Overtime is a key factor of attrition in Sales and R&D Department, but not in HR Department. Most employees in HR Department decided to resign because of the distance from their homes. \nIn addition, most of the HR employees who resign from the company are young (under 30 years old). It should be noted that each department has a different reason why they left the company.", ha = 'left', va = 'center_baseline', color = '#555b6e', fontsize = 16)

fig.tight_layout(w_pad = 0.5)

fig.show()

Export Clean Dataset¶

In [64]:
# # save the cleaned data (ready to model)

# fname_all = 'dataset/data_clean_all.csv'
# fname_selected = 'dataset/data_clean_selected.csv'
# fname_selected_new = 'dataset/data_clean_selected_new.csv'

# fnames = [fname_all, fname_selected, fname_selected_new]
# df_export = [df_clean_all, df_clean_selected, df_clean_selected_new]

# for x_fname, y_df in zip(fnames, df_export):
#     fname = f"{gd_path}/{x_fname}"
#     y_df.to_csv(fname, index = False)
In [65]:
# # save unseen data

# fname_all = 'dataset/data_valid_all.csv'
# fname_selected = 'dataset/data_valid_selected.csv'
# fname_selected_new = 'dataset/data_valid_selected_new.csv'

# fnames = [fname_all, fname_selected, fname_selected_new]
# df_export = [df_valid_all, df_valid_selected, df_valid_selected_new]

# for x_fname, y_df in zip(fnames, df_export):
#     fname = f"{gd_path}/{x_fname}"
#     y_df.to_csv(fname, index = False)

Taking glance at Columns and Data Types

In [66]:
# Let's see the list of columns
df.dtypes
Out[66]:
Age                          int64
Attrition                   object
BusinessTravel              object
DailyRate                    int64
Department                  object
DistanceFromHome             int64
Education                    int64
EducationField              object
EmployeeCount                int64
EmployeeNumber               int64
EnvironmentSatisfaction      int64
Gender                      object
HourlyRate                   int64
JobInvolvement               int64
JobLevel                     int64
JobRole                     object
JobSatisfaction              int64
MaritalStatus               object
MonthlyIncome                int64
MonthlyRate                  int64
NumCompaniesWorked           int64
Over18                      object
OverTime                    object
PercentSalaryHike            int64
PerformanceRating            int64
RelationshipSatisfaction     int64
StandardHours                int64
StockOptionLevel             int64
TotalWorkingYears            int64
TrainingTimesLastYear        int64
WorkLifeBalance              int64
YearsAtCompany               int64
YearsInCurrentRole           int64
YearsSinceLastPromotion      int64
YearsWithCurrManager         int64
dtype: object

MODELLING THE DATA (M OF OSEMN)¶

🧠 Construct models that can prodict and forecast¶

Completion Target: Week 7

PREDICTIVE ANALYTICS¶

INTERPRETING THE DATA (N OF OSEMN)¶

Put the results into good use¶

Completion Target: Week 8

PRESCRIPTIVE ANALYTICS¶